Financial Contributions to Presidential Campaign by SOHAN SAMANTA

INTRODUCTION

For this Exploratory Data Analysis project, I have decided to work with one of the Udacity suggested data sets dealing with the financial contributions to the presidential election of 2016. The data sets have been divided by states, and here I have chosen to look at the state of Illinois. The choice of Illinois is totally random, and the primary motivation behind that is to start from neutral ground without any prior expectations of certain outcomes or conclusions. As data scientists, we are supposed to be able to draw inferences from available data, not the other way round. I start this project with a few initial questions in my mind. And as I progress I try to answer these:

  1. Contributions to parties and candidates are a direct representation of the peoples support. So who garners the most support? Which party is in the lead, and which candidates from individual parties are the best contenders?

  2. What demographical conclusions can we deduce about the donors from the donations?

  3. What geographical aspects do these donations show? Is there any relation between the population of a place and the number of donations?

  4. How do the contributions vary over time?

THE DATASET

Let us first take a look at some of the data.

## 'data.frame':    250411 obs. of  18 variables:
##  $ cmte_id          : chr  "C00575795" "C00580100" "C00577130" "C00580100" ...
##  $ cand_id          : Factor w/ 24 levels "P00003392","P20002671",..: 1 23 12 23 12 12 12 1 1 1 ...
##  $ cand_nm          : Factor w/ 24 levels "Bush, Jeb","Carson, Benjamin S.",..: 4 22 19 22 19 19 19 4 4 4 ...
##  $ contbr_nm        : Factor w/ 52686 levels " PLOEG, DAVID VANDER",..: 15057 38033 27566 36136 28537 22091 22091 11408 3585 18652 ...
##  $ contbr_city      : Factor w/ 1521 levels "0LMSTED","25 F CHICAGO",..: 715 617 243 243 1081 243 243 201 1489 1007 ...
##  $ contbr_st        : Factor w/ 1 level "IL": 1 1 1 1 1 1 1 1 1 1 ...
##  $ contbr_zip       : int  601564641 60192 606223057 60622 615542542 606576225 606576225 622311258 600911971 619430433 ...
##  $ contbr_employer  : Factor w/ 17536 levels "","-","--",".",..: 13786 13862 17462 7625 6922 11229 11229 7625 10075 10695 ...
##  $ contbr_occupation: Factor w/ 8363 levels ""," CERTIFIED REGISTERED NURSE ANESTHETIS",..: 6565 1220 8239 3643 6129 4888 4888 3643 6536 6338 ...
##  $ contb_receipt_amt: num  33.5 400 50 601.6 15 ...
##  $ contb_receipt_dt : Date, format: "2016-04-17" "2016-11-19" ...
##  $ receipt_desc     : Factor w/ 32 levels "","* EARMARKED CONTRIBUTION: SEE BELOW REATTRIBUTION/REFUND PENDING",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ memo_cd          : Factor w/ 2 levels "","X": 2 2 1 2 1 1 1 2 2 2 ...
##  $ memo_text        : Factor w/ 126 levels "","*","* EARMARKED CONTRIBUTION: SEE BELOW",..: 8 1 3 1 3 3 3 8 8 8 ...
##  $ form_tp          : Factor w/ 3 levels "SA17A","SA18",..: 2 2 1 2 1 1 1 2 2 2 ...
##  $ file_num         : int  1091718 1146165 1077404 1146165 1077404 1077404 1077404 1091718 1091718 1091718 ...
##  $ tran_id          : Factor w/ 249967 levels "A0026BD64F7D64A08896",..: 85172 172660 201960 172648 201573 201985 202122 85468 85069 84821 ...
##  $ election_tp      : Factor w/ 5 levels "","G2016","O2016",..: 4 2 4 2 4 4 4 4 4 4 ...

UPDATE THE DATA

We at once notice a few missing pieces that are required to answer the questions we set out to find. One primary missing information in the data set is the political party details, i.e. we do not get any information as to which candidate is affiliated to which party. There are other missing links like the coordinates of the contributor, the county population information, etc. But we will look at them at a later stage.

For now, let us insert the information into our data set and try to summarise the data.

##     cmte_id   cand_id                 cand_nm         contbr_nm
## 1 C00575795 P00003392 Clinton, Hillary Rodham    FUCHS, LEE ANN
## 2 C00580100 P80001571        Trump, Donald J.  PUSATERI, THOMAS
## 3 C00577130 P60007168        Sanders, Bernard   LIEBIG, BRANDON
## 4 C00580100 P80001571        Trump, Donald J. PAVLJASEVIC, MARK
## 5 C00577130 P60007168        Sanders, Bernard   LYBARGER, ANGEL
## 6 C00577130 P60007168        Sanders, Bernard   JAKABHAZY, LYRA
##         contbr_city contbr_st contbr_zip           contbr_employer
## 1 LAKE IN THE HILLS        IL  601564641     SEARS, ROEBUCK AND CO
## 2   HOFFMAN ESTATES        IL      60192             SELF-EMPLOYED
## 3           CHICAGO        IL  606223057 ZACKS INVESTMENT RESEARCH
## 4           CHICAGO        IL      60622     INFORMATION REQUESTED
## 5             PEKIN        IL  615542542          HEYDE EYE CENTER
## 6           CHICAGO        IL  606576225              NOT EMPLOYED
##       contbr_occupation contb_receipt_amt contb_receipt_dt receipt_desc
## 1         SALES MANAGER             33.55       2016-04-17             
## 2          CHIROPRACTOR            400.00       2016-11-19             
## 3          WEB DESIGNER             50.00       2016-03-06             
## 4 INFORMATION REQUESTED            601.61       2016-11-25             
## 5          RECEPTIONIST             15.00       2016-03-05             
## 6          NOT EMPLOYED              5.00       2016-03-06             
##   memo_cd                           memo_text form_tp file_num     tran_id
## 1       X              * HILLARY VICTORY FUND    SA18  1091718    C4726518
## 2       X                                        SA18  1146165  SA18.78425
## 3         * EARMARKED CONTRIBUTION: SEE BELOW   SA17A  1077404 VPF7BKYXK86
## 4       X                                        SA18  1146165  SA18.77686
## 5         * EARMARKED CONTRIBUTION: SEE BELOW   SA17A  1077404 VPF7BKXYB01
## 6         * EARMARKED CONTRIBUTION: SEE BELOW   SA17A  1077404 VPF7BKZ0NG7
##   election_tp      party
## 1       P2016 democratic
## 2       G2016 republican
## 3       P2016 democratic
## 4       G2016 republican
## 5       P2016 democratic
## 6       P2016 democratic

SUMMARY OF THE DATA SET

##  [1] "cmte_id"           "cand_id"           "cand_nm"          
##  [4] "contbr_nm"         "contbr_city"       "contbr_st"        
##  [7] "contbr_zip"        "contbr_employer"   "contbr_occupation"
## [10] "contb_receipt_amt" "contb_receipt_dt"  "receipt_desc"     
## [13] "memo_cd"           "memo_text"         "form_tp"          
## [16] "file_num"          "tran_id"           "election_tp"      
## [19] "party"

So we have 19 variables and 250,411 observations. Thats a lot of data!

##  [1] Clinton, Hillary Rodham   Trump, Donald J.         
##  [3] Sanders, Bernard          O'Malley, Martin Joseph  
##  [5] Cruz, Rafael Edward 'Ted' Walker, Scott            
##  [7] Bush, Jeb                 Rubio, Marco             
##  [9] Christie, Christopher J.  Kasich, John R.          
## [11] Johnson, Gary             Paul, Rand               
## [13] Graham, Lindsey O.        Fiorina, Carly           
## [15] Jindal, Bobby             Santorum, Richard J.     
## [17] Huckabee, Mike            Stein, Jill              
## [19] Carson, Benjamin S.       Lessig, Lawrence         
## [21] Perry, James R. (Rick)    Pataki, George E.        
## [23] McMullin, Evan            Webb, James Henry Jr.    
## 24 Levels: Bush, Jeb Carson, Benjamin S. ... Webb, James Henry Jr.
## [1] "democratic"  "republican"  "libertarian" "green"       "independent"

We have 24 candidates from the state of Illinois representing 5 political parties. Next we see the contributor name, employment and their contributions. We also have the sip code and city name of the individual contributors.These are the points of information that at first glance jumps at me that I can start my work with and get some idea as to their correlations.

Univariate Plots Section

Lets start with a quick look at the cumulative contributions received by the individual candidates. We would also like some way to distinguish the candidates by their parties, so we are color coding them.

From this simple plot only we can identify the fore runners of this election. If we believe that contributions are a direct reference to the people’s choice, then we find Hillary Clinton to be leading the race by a great amount. Donald Trump comes second and Berie Sander a close third. Democratic party seems to be leading with Hillary and Bernie with Republican folllowing them in the second place. Let us take a look at only the party information, without the candidates separated.

So, in Illinois, the democrats get more donations than any other party. Almost 2 times more. And even then, Hillary Clinton gets a major share of the donations t hat the whole party receives. The independent, libertarian and green party have such a small amount of contributions, it will be difficult to draw conclusions from their data. This is one of the reasons that in the later parts of this project we will be focussing mainly on these two parties(Republican and Democratic) and ignoring the rest.

Another idea as to the relative standing of a candidate in his or her own party is the percentage of contribution they received against the total contributions received by the party.

Let us take a quick look at the percentage of donations the members receive.

The percentago of donations received tell us about who is leading the financial race, but it gives us no ideas as to the individual contributions by the contributors.

To get a better sense of the contributions, we could do a box plot. These we will do one party at a time, since we are interested currently as to how one member stacks up against another and what we can deduce from the donations they receive.

Democratic Party

##              contbr_nm                        cand_nm     
##  STUCKEY, RICHARD :    3   Clinton, Hillary Rodham:19776  
##  ABELSON, BENJAMIN:    2   Sanders, Bernard       : 9437  
##  ADAMLE, KIM      :    2   O'Malley, Martin Joseph:  106  
##  ADAMS, THOMAS    :    2   Lessig, Lawrence       :   30  
##  ADKINS, JAMES    :    2   Webb, James Henry Jr.  :    8  
##  (Other)          :29346   (Other)                :    0  
##     party           total_contribution       n          
##  Length:29357       Min.   :-5000.0    Min.   :  1.000  
##  Class :character   1st Qu.:  120.0    1st Qu.:  1.000  
##  Mode  :character   Median :  260.0    Median :  3.000  
##                     Mean   :  667.8    Mean   :  6.791  
##                     3rd Qu.:  667.7    3rd Qu.:  8.000  
##                     Max.   : 8100.0    Max.   :706.000

Hllary Clinton has the largest number of contributors at 19,776. She also has some of the richest contributors. The maximum contribution of $8100 from the summary correpsonds to a donation received by Hillary.

Martin O’Malley on the other hand has a larger range of donations and his median donation received amount is also larger at around $750.

Although the visual range of their donations give us an idea about the the distributions, we should not draw conclusions about their total amount received and such. We should always look at the numbers. A summary of the democratic party would look like the following:

## # A tibble: 5 x 6
## # Groups:   party [1]
##        party                 cand_nm      amount      mean median      n
##        <chr>                  <fctr>       <dbl>     <dbl>  <dbl>  <int>
## 1 democratic Clinton, Hillary Rodham 16152026.77 131.70921     25 122634
## 2 democratic        Lessig, Lawrence    21393.89 329.13677    200     65
## 3 democratic O'Malley, Martin Joseph   127137.00 775.22561    500    164
## 4 democratic        Sanders, Bernard  3295108.27  43.07218     27  76502
## 5 democratic   Webb, James Henry Jr.     8150.00 815.00000    500     10

Republican party

##                     contbr_nm                          cand_nm     
##  ALSAMMARAE, AIHAM       :    6   Trump, Donald J.         :17349  
##  BOARDMAN, LUANN         :    5   Cruz, Rafael Edward 'Ted': 2646  
##  GLEESON, JOHN           :    5   Carson, Benjamin S.      : 1596  
##  ALDRICH, ELIZABETH      :    4   Rubio, Marco             : 1078  
##  AYERS, HELEN Z. MRS.    :    4   Kasich, John R.          :  602  
##  BUNTROCK, DEAN L. MR.   :    4   Bush, Jeb                :  403  
##  BUNTROCK, ROSEMARIE MRS.:    4   Walker, Scott            :  338  
##  COSTELLO, KEVIN         :    4   Paul, Rand               :  325  
##  HUGHES, LOUIS           :    4   Fiorina, Carly           :  243  
##  KAPP, DUANE             :    4   Christie, Christopher J. :  103  
##  KENNY, JAMES            :    4   Graham, Lindsey O.       :   50  
##  LEE, GAYLE              :    4   Huckabee, Mike           :   45  
##  MACNEIL, DAVID          :    4   Santorum, Richard J.     :   14  
##  MERRILL, HAROLD         :    4   Pataki, George E.        :    5  
##  MOON, DANIEL            :    4   Jindal, Bobby            :    3  
##  (Other)                 :24738   (Other)                  :    2  
##     party           total_contribution       n          
##  Length:24802       Min.   :-12700.0   Min.   :  1.000  
##  Class :character   1st Qu.:    40.0   1st Qu.:  1.000  
##  Mode  :character   Median :    80.0   Median :  1.000  
##                     Mean   :   342.8   Mean   :  2.022  
##                     3rd Qu.:   280.3   3rd Qu.:  2.000  
##                     Max.   : 18900.0   Max.   :138.000  
##                                                         
##                                                         
##                                                         
##                                                         
##                                                         
##                                                         
##                                                         
##                                                         
##                                                         
## 

The republican party has a lot of members and their distributions are much more spread out. But again to draw better conclusions we need to look at the numbers.

## # A tibble: 16 x 6
## # Groups:   party [1]
##         party                   cand_nm     amount       mean  median
##         <chr>                    <fctr>      <dbl>      <dbl>   <dbl>
##  1 republican                 Bush, Jeb  640064.00 1014.36450  300.00
##  2 republican       Carson, Benjamin S.  638877.63  106.78215   50.00
##  3 republican  Christie, Christopher J.  162273.00 1277.74016 1000.00
##  4 republican Cruz, Rafael Edward 'Ted' 1241112.74   92.85596   50.00
##  5 republican            Fiorina, Carly  146237.50  215.68953  100.00
##  6 republican        Graham, Lindsey O.   86510.00  920.31915  500.00
##  7 republican            Huckabee, Mike   18500.20  114.90807   50.00
##  8 republican             Jindal, Bobby    5500.00 1833.33333 2700.00
##  9 republican           Kasich, John R.  539898.25  516.64904  250.00
## 10 republican         Pataki, George E.    4200.00  700.00000  250.00
## 11 republican                Paul, Rand  196795.16  168.05735   50.00
## 12 republican    Perry, James R. (Rick)    1000.00  500.00000  500.00
## 13 republican              Rubio, Marco  964158.94  322.78505  100.00
## 14 republican      Santorum, Richard J.   11322.46  419.35037  250.00
## 15 republican          Trump, Donald J. 3490011.12  149.35641   55.13
## 16 republican             Walker, Scott  355916.00  711.83200  250.00
## # ... with 1 more variables: n <int>

Taking both the parties into perspective one thing is pretty clear: Hillary Clinton is leading the race. But there are a lot of similarities between the numbers between Trump and Bernie. It will be interesting to see how they compare against each other.

Lets first look at their summaries.

## # A tibble: 1 x 6
## # Groups:   party [1]
##        party          cand_nm  amount     mean median     n
##        <chr>           <fctr>   <dbl>    <dbl>  <dbl> <int>
## 1 republican Trump, Donald J. 3490011 149.3564  55.13 23367
## # A tibble: 1 x 6
## # Groups:   party [1]
##        party          cand_nm  amount     mean median     n
##        <chr>           <fctr>   <dbl>    <dbl>  <dbl> <int>
## 1 democratic Sanders, Bernard 3295108 43.07218     27 76502

Ok, so the total amounts look close enough. But that does not give us a clear idea as to how their contributions are distributed. For a visual comparison lets plot a freqpoly of the two candidates on the amount that they cumulatively received.

CONTRIBUTORS

We start off by plotting histogram of the number of donors by the amount they donated.

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -12700.00     69.07    200.00    519.05    500.00  18900.00

And as expectd, we find a larger number of people in the lower range. Also noticeable are the peaks at certain figures like 250, 500, 600 and so on. These
are also in the bounds fo our expectations. People are more likely to donate $600 than say 610 or 590. Hence the peaks.

SCATTER PLOT (contributors vs contribution)

Another perspective of the same data may be observed by plotting a scatter plot of the individual contributors vs amount donated and then classifying them with the party they contributed to.

This graph gives some interesting results.

First, we see the absolute disparity between the democratic and republican parties with respect to the other parties. The number of individuals contributing to the 3 lesser known political parties are almost insignificant to the leading ones.

Corresponding to the peaks we found in the earlier histograph, we also find certain parallel lines formed by the dots that are formed due to more number of cotributors at certain amounts.

MEDIAN TREND

Although illuminating, the above graphs dont really show us a trend of the contributors with respect to the amount they are willing to contribute to their parties. An interesting plot from this perspective would be to look at the median amounts received by each party by the number of people donating these median amounts.

TIME FRAME

An important feature that we have neglected to explore all this time has been the time data. There are four years between elections, and the donations happen over time throughout these four years.

It will be interesting to note how the contributions vary over time for the individual parties.

This plot is not gradual in its trend at all. It is mostly comprised of sudden peaks of donations and then gaps of nil or very small donations. This is interesting. The reason, without going into more details or data hunting on the web, may be that these donations correspond with certain political events or decisions by the parties that influenced the supporters to show their support in the form of donations.

The trend of the donations increasing towards the end is very logical and expected. It is obvious that more people tend to donate as the presidential election date nears and they have had a chance to make up their mind as to who they want to support.

We can do the same thing for individual candidates that we have done above for the political parties as a whole.

Here we only take a closer look at the democratic and republican parties.

GEOGRAPHICAL ASPECTS

We know about the zip codes of the contributors. So, it would be interesting to see people from which places are contributing and who are they contributing to. We could plot the different locations that people are contributung from on a map.

If we are lucky, and there are some places which are overwhelmingly in favour of one party or the other, we would see clear distinctions of that as well.

For starters, let us look at the map of USA populated by the donors according to their zip codes converted to longitudes and latitudes.

So, for the state of illinois, ofcourse the donors will also be located in the same state. But we do find some donors outside of the state lines.

Could they be outliers? Are they informations wrongly entered into they system? Could be. Some other explanations may be people who were originally from Illinois, or still are and are temporarily located else where are probably donating money from their temporary locations. Maybe some organisations are donating money for to local candidate, but from institutions that hold their money and are located outside the state.

In any case, with no further information, it is difficult for us to come to any conclusion. In the mean time, let us overlook these locations that are outside the state borders and take a closer look at all of the contributions from within state.

The distribution of donors do not show much character to our dissapointment. They more or less evenly distributed. What remains to be seen is how the distribution would vary once we add in the amount that each of these places contributed.

CONTRIBUTIONS SCALED BY DOT SIZE

This gives us a better idea as to which locations contributed more money and those that had little to contribute. There are some places that are completely blank. These could be places with poor economic standing or less population density.

CONTRIBUTIONS BY COUNTY

Now, the above visualization has considered contributions from particular locations, as in zip codes to be exact. We can group these and view them as contributions from different counties as well.

But there is no relativity here. Only a difference by which the total contributions have varied from one county to the other.

To add the population factor here, we need to look at the ratio of contribution to population for each country.

Ratio of Contribution to population per county

A time frame added to geographical plot

We have already seen one time frame plot. We now add those data to the map above, and look how people from different locations have donated over time from a geographical point of view.

CONTRIBUTORS BY POPULATION

We have looked at the ratio of contribution to population from a geographical perspective. But contributions of an individual is dependent upon a lot of factors, primary of which is his economic standing.

The ratio of amount divided by population does not necessarily portray a direct correlation with the number of people supporting a candidate. For this we need to look not at the amount of contribution, but at the number of contributors with respect to the population.

No surprise here. From our earlier explorations, we expected a result close to this.

CONTRIBUTIONS BY OCCUPATION

A last part that I would like to touch before wrapping this up is the occupation of the contributors.The total variation of occupations are 8364, we wont look at all of them.

Instead we would select a few occupations that have the most number of contributors and study who they contributed towards.

## [1] 8364

——————————————————————————-

Final Plots and Summary

This data set is essentially a log of how much money was contributed by individuals and institutions to the candidates from different political parties from the state of illinois who were contesting in the Presidential Election of 2016.

One of the initial plots which according to me was also one of the most significant ones is the one given below:

Plot One

Description One

The immense disparity between the amounts donated to each of the parties and the political candidate who is the sole reason for the this huge margin is clear from the above graphs. Hillary Clinton not only commands a little above 80% of the total fund collected by the democratic party, she also towers amongst her contestants by leaving behind her close second Donald Trump by almost 12 million dollars.

This difference is significant, and gives a clear view of the political favourite from the state of illinois, at least, as far as donating citizens are concerned.

Plot Two

Description Two

This is an interesting plot and gives us a good idea of the variation of the contributions. It is interesting to note that the median amount of donations for three candidates, Clinton, Cruz and Sanders have peaks at the same median. This tells us that the median amount of contributors for all three of these candidates probably have comparable economic standings. Trump however has a median amount a little higher than them. It can also be seen that both Jeb and Rubio have peaks at higher median values, but their peaks are so small, the number of people corresponding to them are quite frankly not of much significance on the grand scale of things.

Plot Three

Description Three

This color scale across counties representing the contributions per county divided by the population of that county is interesting. We see a very clear picture of which counties contributed more per person and which did not. Most significant are grey counties which represent no contributions what so ever. Now this could be considered as outliers caused by missing zip codes for contributors, but even then this is too large a discrepancy to be a coincidence. That leaves the fact that either the contribution for these regions are too low or the population is so high so as to decrease the small contributions negligible. ——

Reflection

I approached this project by first stating a few questions that I felt one would like to know from a financial data set relating to a presidential election. As I thought about answering these questions, I thought about the information required and the information available in the data set. I then proceeded to answer them by first looking at the data structure and summary and then selecting variables from these summaries and then working on them. After each plot and analysis, I revisited my questions and came up with new ones. My approach was to go over the data and find correlations between the variables I was working with. Once a relation was established, I would dig deeper to see if more inferences could be drawn by maybe adding new data or sometimes even removing some data. It took me more time to complete than I had initally expected. While working I found myself to be continually visiting my previous works and changing them to make minute adjustments based on some new information uncovered at a later stage.

Sources

  1. Main dataset from the Federal Election Commission
  2. Stack overflow for ideas
  3. Population by county from web
  4. Zip code by county from web